-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor genetics calculations to allow import to be triggered separately #650
Conversation
…re calculated separately
- Make R scripts exit immediately on error - Bugfix to expected kinship/validation
…y computed kinship data
@dnicolalde: as part of touching the kinship scripts, I'd like to address known issues. One limitation is that there are instances where a given animal is a hybrid of 2 species. As it stands, the kinship script runs species-by-species, which is clearly a problem there. Let's say there is a Rhesus macaque with a sire who was a pigtail. I dont think just adding the one missing sire to the rhesus dataframe is adequate, since in theory there are more ancestors that could effect this (granted, that's probably rather rare). One possible solution might be to update the R code to identify animals with a cross-species parent, and then merge these species together and run kinship on the two species together. If IDs were unique between species, the only downside is memory. When these scripts were written in ~2010 memory was more limiting than it probably is today. I think ONPRC has a couple legacy hybrid animals and I will give this a look with those. |
- Refactor kinship script to optionally merge species where hybrids are present and process together
A couple changes of note just added:
I've been testing this using the full ONPRC pedigree. It's fairly easy to run against another pedigree if any center is will to share the results of pedigree.sql as a TSV. |
I looked at your changes yesterday. I can ran against our pedigree data and shared the results. I would like to verify that the data generated is not changing that much. |
OK. Other than hybrids I would expect zero changes in coefficients at all, so I'd be interested to know how that looks. |
Hi Ben, I compared the results from the new populateKinship.r script with the results on our production server and found there were no differences in the number of pairs or the coefficients generated. The only change I saw was the addition of the species column in the results. |
@labkey-martyp and @labkey-jeckels: any comments on this? |
@bbimber I'm still testing this out. I tried this on one full set of client data and this PR was missing some of the kinship relationships. I have not had time to investigate but I would guess there are some assumptions made about the data, possibly null or unknown values, that is affecting the results. I can dig more and get more info by the end of the week. But it does look like the validation in the R script is using the same assumptions as it did not catch those missing relationships, where the current validation, using a different query, did pick them up. If you're blocked on this for setting up your cluster calculations it might make sense to split this into two PRs. One with just the import functionality and the other with the R script refactors. |
I dont suppose you're able to share a pedigree file? Is it TNPRC or JHU? The one difference between the R based and java-based is that the R-based check relies on the data that makes it into pedigree.sql. I noticed your java code pulled from demographicsFamily/demographicsOffspring/demographicsSiblings. Those are independent SQL scripts and therefore could easily be applying subtly different filters. This could either be the combination one or the other uses for including or not including supplemental pedigree, or whatever sources of data might exist. Simply looking at rows counts and/or SQL joins on the results of pedigree.sql and those 3 might be informative. |
@labkey-martyp: to follow up on this specific sentence you wrote: But it does look like the validation in the R script is <not?> using the same assumptions as it did not catch those missing relationships, where the current validation, using a different query, did pick them up. If you are intending to say that the kinship pedigree/script does not catch a certain relationship, but the java validation process does catch it, and this is explicitly because they use different SQL queries as input, then I think I would ask the question: "what exactly are we trying to catch here"? If the actual difference is due to different SQL queries returning different results, then wouldnt it be a heck of a lot more direct to do a SQL test of pedigree.sql against those queries queries and ask whether discrepancies should exist? If there are, should the respective SQL get updated? Said another way, I think it's important to separate the questions of "are we giving R the identical/complete input data", and "does the kinship package give the expected results". |
@bbimber I was able to set this code in a test instance of our EHR and it looks like there might be a problem. The file generated in our production server is around 786MB and the one that this new code generated is 417MB. It is adding the data into the corresponding dataset and I will be able to see how many rows I am missing. Any idea why it would create a file half the size? Both started with the same Pedigree table which contains 20,316 rows. |
Follow up, it finished importing and it only generate 12,924,820, in comparison to the current production server that generates 28,219,394. I can work with Marty to see where is it having problems with the calculation. |
Thanks for running this. I have some fairly straightforward checks where we compare the data at each step which would isolate what is or isnt changed. Running with or without the flag to handle hybrids (which is the primary intentional difference) would also be a fast way to determine whether this is the source of a difference. |
This PR contains two main categories of changes. The motivation of these changes was to separate the process of computing the data for EHR genetics (e.g. kinship and inbreeding) from actually importing the data into the EHR. In total, the idea is that PRIMe-seq (which already has a mirrored copy of pedigree data), will execute the default EHR GeneticsCalculation pipeline. It will farm the computation to our cluster, which that server is already configured to do. When complete, this pipeline already saves the results as TSV files. I wrote a separate ETL that will be defined in ONPRC modules, which copies the resulting TSVs to a location visible to PRIMe, and then pings PRIMe via a new server-side action to cause PRIMe to take those TSVs and call EHRService.standaloneProcessKinshipAndInbreeding to actually import them.
The changes within EHR itself are primarily to refactor the portions of the code that import the TSVs be to static, allow it to be called separately, and expose this through EHRService. These changes themselves are basically a refactor without touching much within the code itself.
When I got into the weeds of the R code, I noticed a number of other things that seemed worth cleaning up. These are not directly related to the importing of data, but should be broadly useful: